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Abstract 


Nowadays, represented by Deep Learning techniques, the field of ma¬ 
chine learning is experiencing unprecedented prosperity and its influence 
is demonstrated in academia, industry and civil society. “Intelligent” 
has become a label which could not be neglected for most applications; 
celebrities and scientists also warned that the development of full artifi¬ 
cial intelligence may spell the end of the human race. It seems that the 
answer to building a computer system that could automatically improve 
with experience is right on the next corner. While for AI or machine 
learning researchers, it is a consensus that we are not anywhere near the 
core technique which could bring the Terminator, Number 5 or R2D2 into 
real life, and there is not even a formal definition about what is intelli¬ 
gence, or one of its basic properties: Learning. Therefore, even though 
researchers know these concerns are not necessary currently, there is no 
generalised explanation about why these concerns are not necessary, and 
what properties people should take into account that would make these 
concerns to be necessary. In this paper, starts from analysing the relation 
between information and its representation, a necessary condition for a 
model to be a learning model is proposed. This condition and related 
future works could be used to verify whether a system is able to learn or 
not, thus enriched our understanding of learning: one important property 
of Intelligence. 



1 Introduction 


Although machine learning or artificial intelligence researchers know that we are 
not anywhere near the technique which could enable us to build a machine with 
real intelligence, almost all discussion about various properties of Intelligence 
stay on a philosophical level, such as the ability of learning, the ability of logical 
reasoning, self-consciousness, or even emotion. Formal mathematical definitions 
of these properties are very undefined, and we do not even have any binary 
criteria to verify whether a system really possesses any of these abilities, let 
alone formal methods which can be used to analyse the level of each of these 
abilities. 

By generalising our intuitive understanding of learning ability, this paper 
gives a concise necessary condition for being a learning model. And this pa¬ 
per also presents further research directions which would enable us to analyse 
whether a model will be a learning model or non-learning model during the 
development process. 

The discussion starts with intuitive introductions that prepare readers with 
conceptual understanding of the viewpoint of the following formal discussion. 

1.1 Intuitive introduction: Information 

We are surrounded and processing various information constantly. And cer¬ 
tainty plays a critical role in representing information. Because one piece of 
information must be carried by at least one certain representation. 

• “I’m 188cm” : in this example a certain number together with a unit (cm) 
represents the information on height, the first person pronoun represents 
an identity related to the height information. 

• “Turn left ? Right!” : in this example, the uncertainty of context brings 
ambiguity. 

Even when we are measuring the uncertainty of a system (entropy), we will 
get a certain number or a range with certain boundary; or if we cannot get a 
certain boundary, we still have the certain representation of the system we are 
measuring; or even if we do not know what system we are measuring, we still 
have a certain definition of entropy; and if we have no certainty about anything, 
then there is no information at all. In other words, the amount of certainties also 
equivalent to the amount of information. Therefore, we can summary Feature 
One of information as follows: 

One piece of information is equivalent to at least one certain 
representation, and vice versa. 

1.2 Intuitive introduction: Learning 

We as human cannot sense vast amount of physical realities naturally, such as 
the existence of almost all submicroscopic particles, the existence of majority 
region of electromagnetic wave, the movement of earth, the fact that the Earth is 
a sphere, actually this list could keep growing and includes almost all knowledge 
we have learned since the born of modern science. And we are all very familiar 
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with the process of learning new knowledge based on known knowledge, and 
based on new learned knowledge to learn even newer knowledge. Firstly, this 
paper will give a generalised definition of the relative relation between the known 
knowledge and new learned knowledge. The introduction of this definition starts 
with an example: 

A man was charged with murder, judge or juror has no information about the 
reality of innocent or guilty. A reasonable sentence must be the consequence of 
implementing proper methods on all evidences. 

The evidence is local information, and the reality of innocent or guilty is 
global information. So the same as the known knowledge (local informa¬ 
tion) and new learned knowledge (global information) as introduced above. 
Actually, the notion of “no observation independent reality ” has been 
widely accepted for decades. The most important thing is, the global and local 
relation not only exists in learning high level abstract concept, but also exists 
in learning very fundamental concepts, such as object identification. Because 
we have no direct access to the world p.l8] other than though our sensors. 
The information of the existence of an object (global information) is the 
consequence of learning based on its different possible appearances (local in¬ 
formation) received by our retina. And Feature One of information tells us 
there must be a certain representation of this object (global information). 

Further discussion of learning (section 3) is after a formal discussion about 
information (section 2). These sections are all based on the explanation of 
mapping relations from a new viewpoint. 
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2 Information 

2.1 Information Generator and Representation 
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Figure 1: Mapping Relation and Generators 

Suppose there is a mapping relation between a set X and a set Y, so any subset 
of X can be regarded as being generated by a generator, and the corresponding 
elements of subset of Y are representations of this generator. The joint of any 
two subsets of X, or any two corresponding subsets of Y are not necessarily to 
be null. 

2.2 Domain and Range of a Mapping Relation 

When a mapping relation F is defined between domain D and range O. Then, 
with respect to F^ any subset of its domain could correspond to an unknown 
generator, and the corresponding elements in the range of that domain are the 
representation of that unknown generator. 
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Figure 2: Mapping relation between domain and range 

Then, with respect to F (could be a function), we can define local informa¬ 
tion and global information as follows: 

• Local information: Any input a: S I? is a local information with respect 
to F. 

• Global information: Any element of subset Y d O and the corresponding 
subset X = {x\y GY,y = F{x)} are global information. 

So the global information y G Y,Yn G O represents an unknown generator 
Gn- It is obvious that an unknown information generator G„ could have a 
lot global representations: all elements of T„, and due to the Feature One 
of information, we can say that all elements of Yn should follows a certain 
constraint c. And if the corresponding elements of domain subset A„ do not 
follow the same constraint c, we can say that the global representation {y G Yn) 
of the information generator G„ is the invariant representation with respect 
to its appearances (Vx G A„). 

Example One: 

• Domain: A CCD array, or our retina all provide a very large domain. If 
the value of each pixel is between 0-255 and the size of the image is 
200*200 pixels, than the corresponding size of the domain is 256^°°°°, 
which is a very large number. 

• Local information: Every image of a dog, every possible moment 
captured by our retina, these are all local information. 

• Global information: The existence of the dog, and all its appearances are 
global information. But we can learn the invariant representation form 
local information, a camera cannot. 
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2.3 Information (generator) Type 

It is obvious that different mapping relation pairs different subsets between 
domain and range. 




Figure 3: Different type of information 

Therefore, we can say that a mapping relation along with its domain de¬ 
fine the type of unknown generator G„, or in other words defines the type of 
information. 

2.4 Common constraintiLinear separable hypersurface 

As introduced in section 2.2, global information must follow a certain constraint 
c, so the certainty of information can be guaranteed. There are a large number 
of possible constraints, and we denote the set of all possible constraint as C. 
Now we would like to know which constraint is preferable? and Why? The 
following discussion gives the answer of this two questions. 

Assumption One: 

Within certain type of information, there exists a mapping 
relationship that can define the differences between information 
generators. 

Based on Assumption One and Feature One of information, we know 
that a mapping relation, denoted as Fc-, map different subsets generated by 
unknown generators to global information that distinguishable by applying con¬ 
straint cr. By choosing the simplest constraint cr that each y € O represent a 
distingusiable information generator, so the expression of Fc is: 


Fc:D^ 0{D cR^,Oc R) 


( 1 ) 
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According to this expression, the differences of information generators can be 
identified as follow: 

A set Xi C -D is the global information of an unknown information generator 
Gi, then yi = Fc(x) : x € G R ; another set A 2 C -D is the global 

information of an unknown information generator G 2 , then we have 
2/2 = Fcix) : X e A2,2/2 G R- 

If yi 7 ^ 2 / 2 j then we can say that Gi and G 2 are different information generators, 
and the type of information is defined by Fq. 
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Figure 4: Representation and the Common Constraint 

Furthermore, for real numbers in range O, they can be represent by paral¬ 
lel hyperplanes defined by a vector set Vc, so we can say that points on these 
hyperplanes are population representations (global information) of different 
unknown information generators, and this also explain why researchers prefer 
the property of linear separable so much, because it is not only easy understand¬ 
ing, but also is the common constraint that every element of constraint set G 
can convert to. 


3 Learning 

Previous section shows a formal way of explaining local information and its 
global representation. What being learned is nothing more than information, 
or in other words, is nothing more than global representation, so the analysis 
of learning ability should based on the analysis of the mapping relation as in- 

^The term “population representation in figure 4 means the invariant representation is high 
dimensional 
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troduced in previous section. And again this section starts with an intuitive 
example. 

Example Two: 

(A) select dob from tablel where name = “wu hao” 

(B) y = 2x, when x = 2, we have y = 4 

(C) P(is human | a picture of me) = 0.9999 

(D) People see picture 1, they know it is a pig. People see picture 2, they 
know it is a dragon. People see picture 3, even though they may have 
never seen this before and do not know the name, they still know it is 
different. 



1 2 3 


Figure 5: Three Pictures 

These four models in example two all include one input, a mapping relation 
(function), and one output. A is a database, B is a linear function, C is a 
powerful probabilistic model, and D is us. By analysing example two from the 
mapping relation point of view we know that: 

(A) : The database will return NULL for all inquiries that do not match what 
have been stored previously. 

(B) : Outputs will be different when the inputs are different. And for almost 
all regression problems, we have assumptions about the mapping relation 
between two subsets of our observations, and after modelling our hypoth¬ 
esis of the mapping relation, for a new set of input we can almost always 
expect the output set contains new element which we have never seen 
before. 

(C) : When applying probabilistic model, our hypothesis is based on the be¬ 
lief that there are unknown statistical laws which represent different con¬ 
cepts . Assume mapping relation Fmodei is able to mimic the unknown 
statistical law of the appearance of a cup perfectly so the mapping 
relation we get is as follow: 

^This assumption helps to avoid the discussion of convergence problem. 
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Figure 6: Probabilistic Model (example one) 

At this stage, F^odei is able to give the chance of being a cup for any 
element of the domain. What if we want Fmodei to have the ability to give 
the chance of being other object, such as a dog. Different from what we 
observed in regression problem, there is no real number naturally related 
to different objects. Therefore we chose another way of constructing our 
hypothesis of observation as follow: 



F_modol 
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Figure 7: Probabilistic Model two 
Fmodei ■ —>■ R X is indicator set) 

But the introduction of this “Indicator Set” brings a problem. The size 
of this indicator set represent the information of the amount of objects 
Fmodei can effectively identify. This is a global information (denoted as 
#J) with respect to identities of each object, and it is given by us. For 
objects which are not included in our hypothesis of Fmodei i the outputs 
are almost all clustered around 0. 

(D) We, as human with well developed vision, are always able to identify object 
which have never been seen before, or in other words, the function of our 
vision system are able to define new global information. 

























3.1 Necessary condition of Learning 

Therefore, combining the discussion of information and its representation with 
above discussion, the intuitive understanding of the ability of learning is: 

Learning means the ability to define new information generators 

From this point of view, B and D are learning model, A and C are non¬ 
learning model. And by generalising the description above, we have a formal 
necessary condition of a mapping relation Fmodei to be a learning model as 
follow: 

Condition S: 

yXs C D,Ys = FjnodeliXs),3Xpf C D \ Xs : Yn = FmodeliX^), (Y/v A Yg ^ 

YAr)&(YAr A Yg 7 ^ Yg) 

For any subset Ag of the domain D, the corrsponding subset of range is Yg, 
exists a set X^ which is a subset of the complement set of Ag, so that we 
have a subset Yjv = FAodei(Ajv), the symmetric differences of Yjv and Yg is 
neither Yg nor Y^. 

Generally, if a mapping relation Fmodei satisfies Condition S, then we can 
say Fmodei is able to learn its domain and it is a learning model, otherwise 
it is not able to learn its domain and it is a non-learning model. But there 
are actually two strategies which enable a non-learning model Fmodei to satisfy 
condition S, thus behaves like it is able to learn its domain. 

(A) Shrinking the scale of the domain. 

(B) Iterating over the domain 


Example Three: 
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Figure 8 : Strategies A and B 

• As shown in figure 8 , in picture 1, function F* is defined on domain D, 
and it is not able to learn its domain because it does not satisfy 
condition S. But by shrinking its domain as shown in picture 2, F* 
behaves like it is able to learn most of its domain. This is strategy A 

• As shown in figure , in picture 5, a set of function {Fi, F 2 , F 3 , F 4 } forms 
an approximation of function F^ which satisfies condition S over its 
domain, but neither of these four functions satisfy condition S. This is 
strategy B. 

Usually these two strategies are being applied together. 

For neutralising the effect brought by strategy A and B, condition S needs 
to be extended a little bit. 
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Condition S*: 

Suppose domain D is defined on N dimensional space, extend the scale of 
dimension M of the domain to infinite and mapping relation Fmodei still 
satisfies condition S, then we can say Fmodei is a learning model on domain D, 
dimension M. 

If Fmodei satisfies condition S* on every dimension of its domain Z), then 
we can say Fmodei is able to learn its domain Z), and it is a complete learning 
model. On the other hand, as shown in figure 8 picture 5, the iteration process 
behaves similar with a database systems as shown in picture 3 and 4. And a 
database system is a typical memory system, so we can also say: 


• Not being able to satisfy condition S* is a sufficient condition 
for a model to be a Non-Learning model. 

• Non-Learning model and memory system are equivalent. 

3.2 AI Effect 

Fmodei contains our hypothesis about the mapping relation between different 
subsets of our observation. The ideal situation is our hypothesis could be as 
close to the unknown mapping relation as possible. One method of measuring 
the validity of our hypothesis is as follow: 

Suppose unknown mapping relation is U ■. D ^ O, and our guess if H. So for 
any o C O, we have T = U{o), and G = F[{o), then J{T, G) |^is the validity of 
our hypothesis. 

For simple problems, such as regression problem, our hypothesis works fine , 
but its only learn simple mapping relation. But when facing complex optimisa¬ 
tion problems, for getting a desired mapping relation Fmodei, strategies A and B 
are usually being applied together and if researchers do not exam the property 
of information being used carefully, it is possible that information comes from 
outside of this structure can be introduced. Usually the choice of introducing 
information from outside the structure will help to reduce the difficulties of 
constructing a desired Fmodei or improve its performance. But this behaviour 
will eventually cause Fmodei fail to satisfy condition S* , as shown by the exam¬ 
ple of figure 7 (the probabilistic model).This somehow explains Rodney Brooks 
complain: 

“ Every time we figure out a piece of it, it stops being magical; we 
say, ’Oh, that’s just a computation ” 


4 Problems and Feature Works 

1. More necessary conditions or a sufficient condition 

^ J{T,G) is the Jaccard similarity coefficient of these two sets, It can also being used to 
define the simplicity of a machine learning problem 
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In this paper, condition S* is a necessary condition for a mapping relation 
to be a learning model, this does not rule out the possibility that some 
mapping relation Mp could satisfy condition S* but still against our un¬ 
derstanding of learning. Therefore, further discussion about condition on 
mapping relation is necessary. 

2. Necessary condition on structural level 

The discussion of previous section indicates that Condition S* of mapping 
relation is insufficient for analysing the structural detail of a given model, 
necessary condition on structural level is required to guide the construction 
process of getting a desired mapping relation and presumably would also 
give us a formal explanation about why introducing global information 
from outside the structure could make the problem easy to be solved or 
improve the performance but also causes losing the ability of learning at 
the same time. 

3. Verifying state-of-the-art artificial intelligence systems. 

Because of the overheated expectation of an realistic AI system and the 
following failure in 1970s, researchers usually avoid using the term “AI” 
since then. Instead they express the similar idea by implying that their 
systems are able to learn similar features as human brain could do or be 
able to exhibit unexpected behaviours. 

These days, some systems have demonstrated impressive performances 
which not only make the discussion of AI a hot topic again but also raised 
the alert over the possibility of people may losing control of Artificial 
Intelligence. But no matter how powerful these systems could be, from 
machine learning point of view, they are still solutions of optimisation 
problems which based on different hypothesises of our observation. There¬ 
fore their ability of learning is independent from their performance and 
can be verified in the same way as the discussion of probabilistic model in 
section 3. The research results of problem one and two will provide more 
powerful tools for verifying the learning ability of a give model. 

4. The common learning model 

As discussed in section 2.4, linear separable is the common constraint 
that all constraints can convert to, therefore all learn models (include us) 
can convert to the common learning model which follows this common 
constraint. A prototype which demonstrates the common learning model 
theory is the zero to one step for harvesting information learned by future 
artificial intelligence system. 

5. Begin with object recognition 

The ability of learning is one property of an intelligent system, the learning 
result is not necessary to be right all the time, and the concept of “right” 
is also a global information which needs to be defined in future research. 

The process of learning highly abstract information is tightly related to 
other part of intelligence, after all, during the past 300000 years’ devel¬ 
opment of homo sapiens, most of our learned highly abstract realities are 
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“Wrong” and comparing with realities we know today, it seems “overfit¬ 
ting” is a common phenomenon of the highly abstract information learn¬ 
ing process (World Elephant, then Geocentric Theory). Therefore, further 
study of learning will focus on the learning process of relatively indepen¬ 
dent and low level abstract information, such as the object recognition 
problem. 

6. Dataset separation and merge problem 

When explaining the object recognition problem using the theoretical 
framework proposed in previous sections, two seemingly counterintuitive 
deductions are dataset separation and merge problems. 

• Dataset separation problem: When dataset generated by one object 
is separated , there could be two different set of invariant represen¬ 
tations. 

• Dataset merge problem: When two dataset of different object are 
merged together, there could be a new set of invariant representa¬ 
tions. 



Figure 9: Dataset separate and merge phenomenon 


The description above is the common intuitive understanding of these two 
problem, but there are two mistakes about this understanding: 

• The information generators is defined by the mapping relation, so 
these two problems are equivalent. 

• Since there is no observation independent reality, this is not a prob¬ 
lem, it is a phenomenon which can be used to verify the validity of 
future implementation based on this theoretical framework. 

Note 
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It seems particularly hard for people to understand the dataset merge 
phenomenon in object recognition. It indicates that if we could combine 
two objects together, such as a dog and a cat, and guarantee the 
smoothness of the transformation of their appearances, then we will see 
them as a same thing(things). This confused conclusion exists only 
because we have the information of the existence of these two 
distinguishable objects at first, and there is no real world experience 
which bring us into this mental experiment. But we can always almost 
recognise “Optimus Prime”, “Bumblebee”, “Jazz” and other characters 
in transformers (the movie or the cartoon), no matter they appear as 
cars or human form robots. In this scenario, the recognition process of a 
transformer is no more difficult than recognise anything else and we take 
it for granted. Therefore, the confusion about cat and dog could be 
completely solved someday in the future which “Marvel” or “DC” decide 
to bring us a new superhero whose appearance transforms between cat 
and dog. And this phenomenon should be able to be verified 
experimentally if there are corresponding neuroscientific approaches. 

7. Other related topics 

With the progress of future works, some other related topics will help 
to provide a better understanding about the essence of learning and its 
relation with intelligence. These topics include: detailed analysis of non¬ 
learning model, formal discussion of overfitting problem in the process of 
learning highly abstract information, strategies that could make a learning 
model behaves like a non-learning model. 


5 Conclusion 

John Connor asked T-800: 

“Can you learn stuff that you haven’t been programmed with? so you could 
be...you know, more human? And not such a dork all the time?” 

Even though that is just a scenario of a movie, still we all want to known 
the answer of this question. And for researchers, the most interesting part is 
how this question can be answered. 

Machine Learning theories focus on solving different optimisation problems, 
so we could model hypothesises of our observation. While, the discussion of 
learning should focus on the behaviour of the model. In this paper, by analysing 
the relations among observable appearance(local information), reality (global 
information), their representation and the generalisation of our intuitive under¬ 
standing of learning, we have a necessary condition S* which can be used to get 
the answer of John’s question. 

Starting from a viewpoint which has been missed out, the analysis in this re¬ 
port shows that. Learning is not an action of absorbing existed information. In 
contrast, learning is an action of define information generators, or more specif¬ 
ically, is the ability of always being able to define new information generators. 
And preliminary verification indicates that our hypothesis of what we observed 
could be a learning model which only learn trivial information, and when facing 
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complex problems lack of learning ability is a trade-off for reducing the com¬ 
plexity or improving the performance through introducing global information 
from outside the structure. 

It is possible in the future that we could harvest the information generated 
by a learning model which can learn like us but with greater learning ability. 
And focusing on the direction of getting a common learning model which can 
learn like us, this paper illustrates the logic dependencies of many topics that 
related to understanding the essence of learning. 


^Complexity as defined in section 3.2. 
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